[RFC] Wire contract + mechanical wire-compat CI gate (complement to #1920)#2
Closed
silas-scitix wants to merge 1 commit into
Closed
[RFC] Wire contract + mechanical wire-compat CI gate (complement to #1920)#2silas-scitix wants to merge 1 commit into
silas-scitix wants to merge 1 commit into
Conversation
…vcache-ai#1920) Adds a draft RFC plus a working golden type-code gate that detects RPC wire breaks before merge. Complements kvcache-ai#1920 (Rolling Upgrade) by supplying the enforcement kvcache-ai#1920 leaves to code review: a machine-checkable frozen wire contract (function-ids + arg-tuple type-codes + wire structs), a pre-merge CI gate, a version-gate redesign, and a cross-version interop test methodology. Motivated by the kvcache-ai#2288 silent wire break (a bare trailing tenant_id shifted ~30 handler arg type-codes; nothing in CI caught it). Docs + gate tool only, no production code changed.
Collaborator
Author
|
Superseded by the upstream PR kvcache-ai#2579. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Draft RFC + a working golden type-code CI gate that mechanically detects RPC wire breaks before merge. Positioned as a complement to kvcache-ai#1920 (Rolling Upgrade Support): kvcache-ai#1920 owns the rolling-upgrade policy/features; this RFC owns the contract definition and its enforcement -- the part kvcache-ai#1920 explicitly leaves to "code review".
Why this matters in production (and why I want to land it)
The Mooncake Store master is on the hot path for every prefill: thousands of per-rank clients depend on it for KV-cache metadata. A wire break there is not a localized bug -- it is a fleet-wide outage with a specific, expensive failure mode:
errc::invalid_rpc_arguments), so an upgraded client talking to a not-yet-upgraded master (or vice versa) simply fails. "Upgrade 10% and watch" is impossible; master and every client must switch in one atomic step.function_id = MD5Hash32(name),arg_type_code = MD5Hash32(layout) & 0xFFFFFFFE), derived from source layout, not from any declared version. An additive change can shift the wire while compiling cleanly and passing review -- so "enforce via code review" is not enforcement.A concrete upcoming case we are watching
This is not hypothetical for us.
#2288("Propagate tenant identity through object RPCs") appends a bare trailingtenant_idto ~30WrappedMasterServicehandlers, which shifts their struct_pack argument type-codes (e.g.PutStart 0xfad0c534 -> 0x22f8edba). As soon as that lands in a release, av0.3.11peer and the new master hard-reject each other's affected RPCs -- i.e. it breaks rolling upgrade across that boundary. It is a live case heading for the next version that will hit our production fleet, and exactly the class of change a mechanical gate is meant to surface (intentionally, as a reviewable red check) rather than ship silently.The value is concrete: a frozen wire contract + a pre-merge gate turns "did we break the wire?" from a production incident into a red CI check, and the runtime version-digest turns silent same-version drift between deployed peers into an observable signal. That is what makes rolling upgrades actually safe to run -- the whole promise of kvcache-ai#1920. I am strongly motivated to get this compatibility design landed in production, not just discussed, and happy to do the work to make it production-ready: maintain the golden contract, wire the gate as a required check, build out the cross-version interop matrix, adapt to whatever shape the maintainers prefer, and help land the forward-compatible serialization pieces of kvcache-ai#1920's Phase 1 on top of this enforcement layer.
What's in this PR (docs + gate tool only, NO production code changed)
WIRE-COMPAT-RFC.md-- the proposal: frozen wire contract, pre-merge CI gate, version-gate redesign, evolution rules, interop methodology.gate/-- a small generator + checker that emits every handler'sfunction_id+arg_type_code(+ wire structs) into a checked-in golden file and fails on any unintended drift. Demonstrated: clean tree = green (59 handlers); inject a bare-trailing-arg of the#2288shape onPutEnd= red, naming the handler with old->new code (0x61232de0 -> 0x1b7a228a, function-id unchanged); revert = green again. Built against vendored ylt 0.6.0.ci/wire-compat-gate.yml-- GitHub Actions wiring the checker as a required check on PRs touching the RPC surface.Version-gate redesign (Section 4)
Layers on kvcache-ai#1920's proposed structured
ServiceReady: adds a wire-contract digest derived from code (not a hand-edited constant), so same-version wire drift becomes observable instead of silent. The digest is the runtime mirror of the CI gate -- the gate prevents drift pre-merge, the digest surfaces drift between already-deployed peers.Note
Internal sample PR (fork-only) for review before any upstream proposal.